Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor deferrable mode for BeamRunPythonPipelineOperator and BeamRunJavaPipelineOperator #46678

Merged
merged 1 commit into from
Feb 20, 2025

Conversation

MaksYermak
Copy link
Contributor

@MaksYermak MaksYermak commented Feb 12, 2025

In this PR I have refactored deferrable mode for Dataflow runner for BeamRunPythonPipelineOperator and BeamRunJavaPipelineOperator to be able to submit the Job using Apache Beam from a worker, and then use the Dataflow API to monitor the Job's status from the triggerer.

In the current implementation the operator, in deferrable mode, uses Apache Beam for submitting and monitoring Job from the triggerer, and this implementation confuses our users. Users try to use deferrable mode to optimize resources, but the current implementation does not optimize any it just switches Beam's running processes from worker to triggerer.

By this PR I have changed this logic for Dataflow's runner, because for Dataflow's runner it's possible to change Job's state monitoring process from Apache Beam to Dataflow API. In my implementation the operator starts the Job using Apache Beam on the worker then using is_dataflow_job_id_exist_callback wait until the Job provides dataflow_job_id. After that the operator leaves the Apache Beam waiting process and then starts a Trigger on triggerer for monitoring Job's status using Dataflow API.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added provider:apache-beam provider:google Google (including GCP) related issues labels Feb 12, 2025
@MaksYermak MaksYermak force-pushed the refactor-apache-beam-def-mode branch 4 times, most recently from b6e3815 to 44ea967 Compare February 13, 2025 16:09
Copy link
Contributor

@eladkal eladkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR doesn't explain what is refactored and why do we need it. Does this change affect users in any way?

@MaksYermak
Copy link
Contributor Author

The PR doesn't explain what is refactored and why do we need it. Does this change affect users in any way?

@eladkal I have updated the PR's description for providing more information why this changes were done.

@MaksYermak MaksYermak force-pushed the refactor-apache-beam-def-mode branch from 44ea967 to cb5ae50 Compare February 14, 2025 10:30
@MaksYermak MaksYermak force-pushed the refactor-apache-beam-def-mode branch from cb5ae50 to 0e879ec Compare February 17, 2025 09:04
@VladaZakharova
Copy link
Contributor

Hi @eladkal @potiuk
Can you please check this PR? It's quite important one :)

@potiuk potiuk merged commit ee68ddf into apache:main Feb 20, 2025
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider:apache-beam provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants